The scope of this project was to create a standalone R program that would examine input data from an online peer-to-peer marketplace by selecting influential factors for analyzing and applying descriptive statistical methods. The dataset would contain transaction level information related to loan and borrower data, interest rates, and borrower demographics. Then, the program would output the results from the data analysis in a report, in which identifies loan performance and potential risk factors in the data. I conducted an Exploratory Data Analysis (EDA) on one of the curated data sets, provided by Udacity, from Prosper, which is America’s first marketplace lending platform, with over $12 billion in funded loans as of June 2021.
This data set contains 113,937 loans with 81 variables on each loan, including loan amount, borrower rate (or interest rate), current loan status, borrower income, borrower employment status, borrower credit history, and the latest payment information. The dataset contains loans created between 2005-Q4 and 2014-Q1 with a last updated date of 03/11/2014. Here’s a link to the variable definitions for this dataset.
The goal of this project is to create a program that would use descriptive analysis to evaluate peer-to-peer lending data. The program would return aggregations and derived values of key attributes using descriptive statistical methods and visualizations. The program will return any conclusions drawn or results of key indicators or factors that may have implications on key business decisions.
First, the following are the object names of the dataset.
## [1] "ListingKey" "ListingNumber"
## [3] "ListingCreationDate" "CreditGrade"
## [5] "Term" "LoanStatus"
## [7] "ClosedDate" "BorrowerAPR"
## [9] "BorrowerRate" "LenderYield"
## [11] "EstimatedEffectiveYield" "EstimatedLoss"
## [13] "EstimatedReturn" "ProsperRating..numeric."
## [15] "ProsperRating..Alpha." "ProsperScore"
## [17] "ListingCategory..numeric." "BorrowerState"
## [19] "Occupation" "EmploymentStatus"
## [21] "EmploymentStatusDuration" "IsBorrowerHomeowner"
## [23] "CurrentlyInGroup" "GroupKey"
## [25] "DateCreditPulled" "CreditScoreRangeLower"
## [27] "CreditScoreRangeUpper" "FirstRecordedCreditLine"
## [29] "CurrentCreditLines" "OpenCreditLines"
## [31] "TotalCreditLinespast7years" "OpenRevolvingAccounts"
## [33] "OpenRevolvingMonthlyPayment" "InquiriesLast6Months"
## [35] "TotalInquiries" "CurrentDelinquencies"
## [37] "AmountDelinquent" "DelinquenciesLast7Years"
## [39] "PublicRecordsLast10Years" "PublicRecordsLast12Months"
## [41] "RevolvingCreditBalance" "BankcardUtilization"
## [43] "AvailableBankcardCredit" "TotalTrades"
## [45] "TradesNeverDelinquent..percentage." "TradesOpenedLast6Months"
## [47] "DebtToIncomeRatio" "IncomeRange"
## [49] "IncomeVerifiable" "StatedMonthlyIncome"
## [51] "LoanKey" "TotalProsperLoans"
## [53] "TotalProsperPaymentsBilled" "OnTimeProsperPayments"
## [55] "ProsperPaymentsLessThanOneMonthLate" "ProsperPaymentsOneMonthPlusLate"
## [57] "ProsperPrincipalBorrowed" "ProsperPrincipalOutstanding"
## [59] "ScorexChangeAtTimeOfListing" "LoanCurrentDaysDelinquent"
## [61] "LoanFirstDefaultedCycleNumber" "LoanMonthsSinceOrigination"
## [63] "LoanNumber" "LoanOriginalAmount"
## [65] "LoanOriginationDate" "LoanOriginationQuarter"
## [67] "MemberKey" "MonthlyLoanPayment"
## [69] "LP_CustomerPayments" "LP_CustomerPrincipalPayments"
## [71] "LP_InterestandFees" "LP_ServiceFees"
## [73] "LP_CollectionFees" "LP_GrossPrincipalLoss"
## [75] "LP_NetPrincipalLoss" "LP_NonPrincipalRecoverypayments"
## [77] "PercentFunded" "Recommendations"
## [79] "InvestmentFromFriendsCount" "InvestmentFromFriendsAmount"
## [81] "Investors"
Prosper Loan Data contains 113,937 observations and 81 variables.
Next, the following is the initial dataframe’s structure.
## 'data.frame': 113937 obs. of 81 variables:
## $ ListingKey : chr "1021339766868145413AB3B" "10273602499503308B223C1" "0EE9337825851032864889A" "0EF5356002482715299901A" ...
## $ ListingNumber : int 193129 1209647 81716 658116 909464 1074836 750899 768193 1023355 1023355 ...
## $ ListingCreationDate : chr "2007-08-26 19:09:29.263000000" "2014-02-27 08:28:07.900000000" "2007-01-05 15:00:47.090000000" "2012-10-22 11:02:35.010000000" ...
## $ CreditGrade : chr "C" "" "HR" "" ...
## $ Term : int 36 36 36 36 36 60 36 36 36 36 ...
## $ LoanStatus : chr "Completed" "Current" "Completed" "Current" ...
## $ ClosedDate : chr "2009-08-14 00:00:00" "" "2009-12-17 00:00:00" "" ...
## $ BorrowerAPR : num 0.165 0.12 0.283 0.125 0.246 ...
## $ BorrowerRate : num 0.158 0.092 0.275 0.0974 0.2085 ...
## $ LenderYield : num 0.138 0.082 0.24 0.0874 0.1985 ...
## $ EstimatedEffectiveYield : num NA 0.0796 NA 0.0849 0.1832 ...
## $ EstimatedLoss : num NA 0.0249 NA 0.0249 0.0925 ...
## $ EstimatedReturn : num NA 0.0547 NA 0.06 0.0907 ...
## $ ProsperRating..numeric. : int NA 6 NA 6 3 5 2 4 7 7 ...
## $ ProsperRating..Alpha. : chr "" "A" "" "A" ...
## $ ProsperScore : num NA 7 NA 9 4 10 2 4 9 11 ...
## $ ListingCategory..numeric. : int 0 2 0 16 2 1 1 2 7 7 ...
## $ BorrowerState : chr "CO" "CO" "GA" "GA" ...
## $ Occupation : chr "Other" "Professional" "Other" "Skilled Labor" ...
## $ EmploymentStatus : chr "Self-employed" "Employed" "Not available" "Employed" ...
## $ EmploymentStatusDuration : int 2 44 NA 113 44 82 172 103 269 269 ...
## $ IsBorrowerHomeowner : chr "True" "False" "False" "True" ...
## $ CurrentlyInGroup : chr "True" "False" "True" "False" ...
## $ GroupKey : chr "" "" "783C3371218786870A73D20" "" ...
## $ DateCreditPulled : chr "2007-08-26 18:41:46.780000000" "2014-02-27 08:28:14" "2007-01-02 14:09:10.060000000" "2012-10-22 11:02:32" ...
## $ CreditScoreRangeLower : int 640 680 480 800 680 740 680 700 820 820 ...
## $ CreditScoreRangeUpper : int 659 699 499 819 699 759 699 719 839 839 ...
## $ FirstRecordedCreditLine : chr "2001-10-11 00:00:00" "1996-03-18 00:00:00" "2002-07-27 00:00:00" "1983-02-28 00:00:00" ...
## $ CurrentCreditLines : int 5 14 NA 5 19 21 10 6 17 17 ...
## $ OpenCreditLines : int 4 14 NA 5 19 17 7 6 16 16 ...
## $ TotalCreditLinespast7years : int 12 29 3 29 49 49 20 10 32 32 ...
## $ OpenRevolvingAccounts : int 1 13 0 7 6 13 6 5 12 12 ...
## $ OpenRevolvingMonthlyPayment : num 24 389 0 115 220 1410 214 101 219 219 ...
## $ InquiriesLast6Months : int 3 3 0 0 1 0 0 3 1 1 ...
## $ TotalInquiries : num 3 5 1 1 9 2 0 16 6 6 ...
## $ CurrentDelinquencies : int 2 0 1 4 0 0 0 0 0 0 ...
## $ AmountDelinquent : num 472 0 NA 10056 0 ...
## $ DelinquenciesLast7Years : int 4 0 0 14 0 0 0 0 0 0 ...
## $ PublicRecordsLast10Years : int 0 1 0 0 0 0 0 1 0 0 ...
## $ PublicRecordsLast12Months : int 0 0 NA 0 0 0 0 0 0 0 ...
## $ RevolvingCreditBalance : num 0 3989 NA 1444 6193 ...
## $ BankcardUtilization : num 0 0.21 NA 0.04 0.81 0.39 0.72 0.13 0.11 0.11 ...
## $ AvailableBankcardCredit : num 1500 10266 NA 30754 695 ...
## $ TotalTrades : num 11 29 NA 26 39 47 16 10 29 29 ...
## $ TradesNeverDelinquent..percentage. : num 0.81 1 NA 0.76 0.95 1 0.68 0.8 1 1 ...
## $ TradesOpenedLast6Months : num 0 2 NA 0 2 0 0 0 1 1 ...
## $ DebtToIncomeRatio : num 0.17 0.18 0.06 0.15 0.26 0.36 0.27 0.24 0.25 0.25 ...
## $ IncomeRange : chr "$25,000-49,999" "$50,000-74,999" "Not displayed" "$25,000-49,999" ...
## $ IncomeVerifiable : chr "True" "True" "True" "True" ...
## $ StatedMonthlyIncome : num 3083 6125 2083 2875 9583 ...
## $ LoanKey : chr "E33A3400205839220442E84" "9E3B37071505919926B1D82" "6954337960046817851BCB2" "A0393664465886295619C51" ...
## $ TotalProsperLoans : int NA NA NA NA 1 NA NA NA NA NA ...
## $ TotalProsperPaymentsBilled : int NA NA NA NA 11 NA NA NA NA NA ...
## $ OnTimeProsperPayments : int NA NA NA NA 11 NA NA NA NA NA ...
## $ ProsperPaymentsLessThanOneMonthLate: int NA NA NA NA 0 NA NA NA NA NA ...
## $ ProsperPaymentsOneMonthPlusLate : int NA NA NA NA 0 NA NA NA NA NA ...
## $ ProsperPrincipalBorrowed : num NA NA NA NA 11000 NA NA NA NA NA ...
## $ ProsperPrincipalOutstanding : num NA NA NA NA 9948 ...
## $ ScorexChangeAtTimeOfListing : int NA NA NA NA NA NA NA NA NA NA ...
## $ LoanCurrentDaysDelinquent : int 0 0 0 0 0 0 0 0 0 0 ...
## $ LoanFirstDefaultedCycleNumber : int NA NA NA NA NA NA NA NA NA NA ...
## $ LoanMonthsSinceOrigination : int 78 0 86 16 6 3 11 10 3 3 ...
## $ LoanNumber : int 19141 134815 6466 77296 102670 123257 88353 90051 121268 121268 ...
## $ LoanOriginalAmount : int 9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
## $ LoanOriginationDate : chr "2007-09-12 00:00:00" "2014-03-03 00:00:00" "2007-01-17 00:00:00" "2012-11-01 00:00:00" ...
## $ LoanOriginationQuarter : chr "Q3 2007" "Q1 2014" "Q1 2007" "Q4 2012" ...
## $ MemberKey : chr "1F3E3376408759268057EDA" "1D13370546739025387B2F4" "5F7033715035555618FA612" "9ADE356069835475068C6D2" ...
## $ MonthlyLoanPayment : num 330 319 123 321 564 ...
## $ LP_CustomerPayments : num 11396 0 4187 5143 2820 ...
## $ LP_CustomerPrincipalPayments : num 9425 0 3001 4091 1563 ...
## $ LP_InterestandFees : num 1971 0 1186 1052 1257 ...
## $ LP_ServiceFees : num -133.2 0 -24.2 -108 -60.3 ...
## $ LP_CollectionFees : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LP_GrossPrincipalLoss : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LP_NetPrincipalLoss : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LP_NonPrincipalRecoverypayments : num 0 0 0 0 0 0 0 0 0 0 ...
## $ PercentFunded : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Recommendations : int 0 0 0 0 0 0 0 0 0 0 ...
## $ InvestmentFromFriendsCount : int 0 0 0 0 0 0 0 0 0 0 ...
## $ InvestmentFromFriendsAmount : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Investors : int 258 1 41 158 20 1 1 1 1 1 ...
Then, I converted the credit scores’ lower and upper ranges into a single credit score by taking the average of the two values.
# Created new variable by converting CreditScoreRangeLower and
# CreditScoreRangeUpper into a single value (the average of the two)
loandata <- loandata %>%
mutate(CreditScore = CreditScoreRangeLower / 2 + CreditScoreRangeUpper / 2)
names(loandata)
## [1] "ListingKey" "ListingNumber"
## [3] "ListingCreationDate" "CreditGrade"
## [5] "Term" "LoanStatus"
## [7] "ClosedDate" "BorrowerAPR"
## [9] "BorrowerRate" "LenderYield"
## [11] "EstimatedEffectiveYield" "EstimatedLoss"
## [13] "EstimatedReturn" "ProsperRating..numeric."
## [15] "ProsperRating..Alpha." "ProsperScore"
## [17] "ListingCategory..numeric." "BorrowerState"
## [19] "Occupation" "EmploymentStatus"
## [21] "EmploymentStatusDuration" "IsBorrowerHomeowner"
## [23] "CurrentlyInGroup" "GroupKey"
## [25] "DateCreditPulled" "CreditScoreRangeLower"
## [27] "CreditScoreRangeUpper" "FirstRecordedCreditLine"
## [29] "CurrentCreditLines" "OpenCreditLines"
## [31] "TotalCreditLinespast7years" "OpenRevolvingAccounts"
## [33] "OpenRevolvingMonthlyPayment" "InquiriesLast6Months"
## [35] "TotalInquiries" "CurrentDelinquencies"
## [37] "AmountDelinquent" "DelinquenciesLast7Years"
## [39] "PublicRecordsLast10Years" "PublicRecordsLast12Months"
## [41] "RevolvingCreditBalance" "BankcardUtilization"
## [43] "AvailableBankcardCredit" "TotalTrades"
## [45] "TradesNeverDelinquent..percentage." "TradesOpenedLast6Months"
## [47] "DebtToIncomeRatio" "IncomeRange"
## [49] "IncomeVerifiable" "StatedMonthlyIncome"
## [51] "LoanKey" "TotalProsperLoans"
## [53] "TotalProsperPaymentsBilled" "OnTimeProsperPayments"
## [55] "ProsperPaymentsLessThanOneMonthLate" "ProsperPaymentsOneMonthPlusLate"
## [57] "ProsperPrincipalBorrowed" "ProsperPrincipalOutstanding"
## [59] "ScorexChangeAtTimeOfListing" "LoanCurrentDaysDelinquent"
## [61] "LoanFirstDefaultedCycleNumber" "LoanMonthsSinceOrigination"
## [63] "LoanNumber" "LoanOriginalAmount"
## [65] "LoanOriginationDate" "LoanOriginationQuarter"
## [67] "MemberKey" "MonthlyLoanPayment"
## [69] "LP_CustomerPayments" "LP_CustomerPrincipalPayments"
## [71] "LP_InterestandFees" "LP_ServiceFees"
## [73] "LP_CollectionFees" "LP_GrossPrincipalLoss"
## [75] "LP_NetPrincipalLoss" "LP_NonPrincipalRecoverypayments"
## [77] "PercentFunded" "Recommendations"
## [79] "InvestmentFromFriendsCount" "InvestmentFromFriendsAmount"
## [81] "Investors" "CreditScore"
Next, I will use a subset of this dataset for performing an analysis. The main purpose of this analysis is to explore the various factors that affect borrowers’ rates. The following are the variables I would like to use.
## [1] "LoanNumber" "LoanOriginalAmount"
## [3] "LoanOriginationDate" "LoanOriginationQuarter"
## [5] "Term" "LoanStatus"
## [7] "BorrowerRate" "LenderYield"
## [9] "ProsperRating..Alpha." "ProsperScore"
## [11] "ListingCategory..numeric." "EmploymentStatus"
## [13] "IsBorrowerHomeowner" "CreditScore"
## [15] "CreditScoreRangeLower" "CreditScoreRangeUpper"
## [17] "InquiriesLast6Months" "PublicRecordsLast10Years"
## [19] "IncomeRange" "IncomeVerifiable"
Based on the variables in this dataset, I will transform Term and ListingCategory columns from ‘numeric’ to ‘factor’. Then, I will transform the ListingCreationDate column from ‘factor’ to ‘Date’. The following is the dataset’s structure transformed.
## 'data.frame': 113937 obs. of 20 variables:
## $ LoanNumber : int 19141 134815 6466 77296 102670 123257 88353 90051 121268 121268 ...
## $ LoanOriginalAmount : int 9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
## $ LoanOriginationDate : chr "2007-09-12 00:00:00" "2014-03-03 00:00:00" "2007-01-17 00:00:00" "2012-11-01 00:00:00" ...
## $ LoanOriginationQuarter : chr "Q3 2007" "Q1 2014" "Q1 2007" "Q4 2012" ...
## $ Term : Factor w/ 3 levels "1 year","3 years",..: 2 2 2 2 2 3 2 2 2 2 ...
## $ LoanStatus : chr "Completed" "Current" "Completed" "Current" ...
## $ BorrowerRate : num 0.158 0.092 0.275 0.0974 0.2085 ...
## $ LenderYield : num 0.138 0.082 0.24 0.0874 0.1985 ...
## $ ProsperRating : Factor w/ 8 levels "AA","A","B","C",..: NA 2 NA 2 5 3 6 4 1 1 ...
## $ ProsperScore : Factor w/ 11 levels "1","2","3","4",..: NA 7 NA 9 4 10 2 4 9 11 ...
## $ ListingCategory : Factor w/ 21 levels "NA","Debt Consolidation",..: 1 3 1 17 3 2 2 3 8 8 ...
## $ EmploymentStatus : chr "Self-employed" "Employed" "Not available" "Employed" ...
## $ IsBorrowerHomeowner : chr "True" "False" "False" "True" ...
## $ CreditScore : num 650 690 490 810 690 ...
## $ CreditScoreRangeLower : int 640 680 480 800 680 740 680 700 820 820 ...
## $ CreditScoreRangeUpper : int 659 699 499 819 699 759 699 719 839 839 ...
## $ InquiriesLast6Months : int 3 3 0 0 1 0 0 3 1 1 ...
## $ PublicRecordsLast10Years: int 0 1 0 0 0 0 0 1 0 0 ...
## $ IncomeRange : chr "$25,000-49,999" "$50,000-74,999" "Not displayed" "$25,000-49,999" ...
## $ IncomeVerifiable : chr "True" "True" "True" "True" ...
In this section, I will perform some preliminary exploration of the transformed dataset. I will run some summaries of the data and create univariate plots to understand the structure of the individual variables in this dataset.
First, let’s take a look on various categorical variables starting with Term.
## .
## 1 year 3 years 5 years
## 0.01416572 0.77040821 0.21542607
Above 90% of the loans in this dataset are with term more than 1 year. Most of them is of 3 years.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2005 2008 2012 2011 2013 2014
The chart took a plunge in 2009 and exploded the year after peaking in 2013.
## .
## Cancelled Chargedoff Completed
## 0.0000438839 0.1052511476 0.3341671274
## Current Defaulted FinalPaymentInProgress
## 0.4965551138 0.0440418828 0.0017992399
## Past Due (>120 days) Past Due (1-15 days) Past Due (16-30 days)
## 0.0001404285 0.0070740848 0.0023258467
## Past Due (31-60 days) Past Due (61-90 days) Past Due (91-120 days)
## 0.0031859712 0.0027471322 0.0026681412
There are four major loan status: “Defaulted”, “Current”, “Completed”, and “Chargedoff.”
## .
## AA A B C D E HR
## 0.06330949 0.17148480 0.18362344 0.21619742 0.16822033 0.11543493 0.08172958
## NA
## 0.00000000
The above bar chart shows Prosper’s rating distribution. The unrated loans are removed in this plot.
## .
## Employed Full-time Not available Not employed
## 0.019791639 0.590870393 0.231312041 0.046929443 0.007328611
## Other Part-time Retired Self-employed
## 0.033404425 0.009549137 0.006977540 0.053836769
The majority of Prosper users are either employed or have a full-time job.
About half of the users are homeowners.
## .
## $0 $1-24,999 $100,000+ $25,000-49,999 $50,000-74,999
## 0.005450380 0.063842299 0.152163037 0.282542107 0.272519024
## $75,000-99,999 Not displayed Not employed
## 0.148468013 0.067941055 0.007074085
About 70% of borrowers’ income are in the range of $25,000 - $100,000
Most of borrowers have their income verified.
Next, let’s take a look on various numerical variables starting with BorrowerRate.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.1340 0.1840 0.1928 0.2500 0.4975
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.0100 0.1242 0.1730 0.1827 0.2400 0.4925
The BorrowerRate has a right-skewed distribution, with a mean of 19.28%, and median of 18.40%. Similarly, the LenderYield also has a right-skewed distribution, with a mean of 18.27%, and median of 17.30%.
The ranges of credit scores are small compared to the distributions. I will use the created credit score variable that takes the averages of the two ranges for further analyses.
## [1] 3 0 1 7 2 5 4 11 6 9 44 NA 8 12 19 42 27 10 15
## [20] 13 21 17 38 16 14 18 22 25 28 20 23 26 29 24 40 41 30 33
## [39] 31 32 34 35 36 63 97 37 46 52 105 53 50
The distribution after adding a scale_x_log10() layer of InquiriesLast6Months is right-skewed and long-tailed. Outliers like 105 and 53 exist.
## [1] 0 1 2 5 NA 3 4 7 6 11 8 12 9 10 15 21 13 25 38 14 16 30 20 34 22
## [26] 17
The distribution after adding a scale_x_log10() layer of PublicRecordsLast10Years is right-skewed and has some outliers.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 4000 6500 8337 12000 35000
LoanOriginalAmount also displays a long-tailed distribution, with some numbers especially higher. The reason why these numbers are more frequent might be because they are the exact numbers (eg. 10,000, 15,000, 20,000) that people tend to choose with a loan original amount minimum of $1,000 and maximum of $35,000
This is a very large dataset with 113937 records and 81 variables on each loan and covers the period from 2005 through 2014. For simplicity with the analysis, I use a subset of the original dataset containing the same number of records, but only with 15 variables that I want to explore. The 15 variables are as follows:
Categorical:
* Term
* LoanStatus
* ProsperRating
* ListingCategory
* EmploymentStatus
* IsBorrowerHomeowner
* IncomeRange
* IncomeVerifiable
Numerical:
* BorrowerRate
* LenderYield
* CreditScoreRangeLower
* CreditScoreRangeUpper
* CreditScore
* InquiriesLast6Months
* PublicRecordsLast10Years
* LoanOriginalAmount
The main features of interest in this dataset are BorrowerRate and ProsperRating, where ProsperRating might be the overall assessment and main quality indicators of borrowers performed by Prosper. Also, I’ve seen evidence that LenderYield is the most important factor for investors.
Features om the dataset that I think will help support my investigation into features of interest are variables regarding borrower’s status (i.e.EmploymentStatus, IsBorrowerHomeowner, etc.) as well as those that reveal borrower’s past payment history (i.e. PublicRecords, Inquiries, etc.). Also, I did not find anything unusual about IncomeRange and IncomeVerifiable, variables that I was expecting more from. On the other hand, maybe this variable have more to tell when related to others.
I factorized a few variables because those variables are either intrinsically categorical or they only have several possible values.
Of the features you investigated, I observed some unusual distributions in the LoanOriginalAmount histogram. The frequencies are higher around 4,000, 5,000, 10,000, 15,000 and so on. I believe these amounts are more common for most borrowers to choose from.
I did not perform any operations on the dataset, as the data was tidy, Also, I did not perform other transformation other than changing their classes (e.g. numeric -> factor ) since categorical variables enter into statistical models differently than continuous variables. Thus, storing data as factors ensures that the modeling functions will treat such data accordingly.
The correlation matrix shows that there are two strong positive relationships among the features. The trivial features are
CreditScoreRangeUpperandCreditScoreRangeLower. Next are theBorrowerRateandLenderYield.
## [1] 0.9992113
The
BorrowerRateand theLenderYieldhave a near perfect linear relationship. This is a result from Prosper charging a fixed rate on loans. Therefore, theBorrowerRateand theLenderYieldshould have a strong positive relationship.
This boxplot shows that
ProsperRatingdoes affect theBorrowerRateand its relationship.
Now, I want to inspect other factors that also have impact on the BorrowerRate and the ProsperRating. Let’s look at the BorrowerRate next.
## $`1 year`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0400 0.0929 0.1434 0.1501 0.2064 0.2669
##
## $`3 years`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.1274 0.1815 0.1935 0.2599 0.4975
##
## $`5 years`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0669 0.1490 0.1870 0.1930 0.2319 0.3304
The average rate of the 3-year term is larger than the 1-year term. However, these terms are very close to the 5-year term average. Also, loans with a 5-year term average have a smaller range than the 3-year term group.
## [1] -0.3289599
This graph shows that there is a fairly weak negative relationship between
BorrowerRateandOriginalLoanAmount.
It seems that some categories have lower rates and others have higher rates.
Borrowers that are income verifiable or homeowners seem to have lower rates, which makes sense.
Finally, let’s take a look at how ProsperRating affect the aforementioned features.
## AA A B C D E HR NA NA's
## 5372 14551 15581 18345 14274 9795 6935 0 29084
The majority of loans in this dataset are non-classified (i.e.’NA’), but, among those that are classified, loans with a ‘C’ rating is the highest.
A higher credit score by the consumer rating agency also tends to have a higher
ProsperRating.
For top
ProsperRatinggroups like “AA” and “A”, Homeowners are the majority.
This chart shows evidence that it is difficult to get a good rating if a borrower’s income not verifiable
First, the LenderYield and BorrowerRate are strongly related because the fixed rate charged by Prosper. Second, the BorrowerRate and ProsperRating do vary with borrower status. In addition, a good relationships I found was through the proprietary Prosper Rating system.
The relationship between ListCategory and BorrowerRate is interesting. In average, for this relationship, cosmetic procedure’s rates are relatively higher, whereas boat rates are lower.
The strongest relationship I found is the one between the LenderYield and BorrowerRate, with a correlation coefficient very close to 1. In addition, the relationship between ProsperRating / LenderYield and ProsperRating / BorrowerRate have an inverse relationship. The higher the rating, the lower the LenderYield and BorrowerRate.
Now let’s put everything together based on what I found in the bivariate plots section by creating a few multivariate plots to investigate more complex interactions between variables.
Using jitter plot, we can visualize the real distribution of
BorrrowerRaterelative toProsperRating. In addition, I use different colors to indicate whether the individual borrower is a homeowner. This plot shows that most of the “AA” rated borrowers are also homeowners.
In this plot, I found evidence that each rating has its own colored area that is easily recognized, which indicates that
ProsperRatingis a good predictor for theBorrowerRate. Also, highly rated loans (i.e. “AA”, “A”, and “B”) can have loan original amount over $30,000. On the other hand, most lowly rated loans (i.e. “E” and “HR”) are below $10,000.
## # A tibble: 6 x 3
## # Groups: ProsperRating [1]
## ProsperRating ListingCategory BorrowerRate
## <fct> <fct> <dbl>
## 1 AA NA 0.0655
## 2 AA Debt Consolidation 0.0800
## 3 AA Home Improvement 0.0789
## 4 AA Business 0.0823
## 5 AA Student Use 0.0821
## 6 AA Auto 0.0759
This plot shows the relative value of the average
BorrowerRatesof differentListCategory.
Compared to the previous plot, the boxplots give us more information about the outliers and range of the distribution. However, it’s more difficult to compare mean values between different
Listingcategory.
Based on this plot, ‘Boat’, ‘Green loans’, and ‘RV’ categories have less data points. As a result, these catefories have less outliers as seen in the previous plots.
The heat map shows that highly rated “Not employed” borrowers have to pay slight higher rates. The jitter plots shows that even “Not employed”, “Retired”, or “Part-time” borrowers can get loan with rates lower than 0.2.
##
##
## mean
## -------- ----------
## 1 year 0.1500807
## 3 years 0.1934855
## 5 years 0.1929907
##
##
## standard deviation
## -------- -------------------
## 1 year 0.0678582
## 3 years 0.0792523
## 5 years 0.0556659
This chart shows a closer look at
BorrowerRateandProsperRatingin which most loans have either a 3-year or 5-year term with higher borrower rates. On the other hand, the main difference between 3-year and 5-year term loans is deviation.
There’s evidence that Prosper optimized their rating model throughout the year, as we see the borrower and the variation between borrower rate is not that significant anymore. As a result, we tend to have smaller standard deviation year-over-year. Something to note is the amount of borrowing that suddenly decreased in 2013.
A feature that strengthened each other in terms of looking at the features of interest is Term, as this is a good indicator whether we have a better BorrowerRate or not. In addition, we see how variables like BorrowerRate and ProsperRating come together and how it affect each other.
The criteria for being an AA borrowers seems to be tighten year-over-year. Also, there seems to be a fixed borrower rate for both the HR and AA criteria.
No, I did not create any models with this dataset.
At this point, I’ve done some exploration and have built up an understanding of the structure of and relationships between the variables in this Prosper Loan dataset. Next, I will select three plots from all of my previous exploration to present here as a summary of some of my most interesting findings.
The Prosper rating is an important factor for the borrower rate. In other words, a high Prosper rating is linked to lower borrower rates and vice versa.
Based on this graph, highly rated loans (i.e. “AA”, “A”, and “B”) can have original amount over $30,000, and most lowly rated loans (i.e. “E” and “HR”) are under 10,000. As a results, Prosper rating is an great indicator of the borrower rate. As you can see, the different colored areas can be easily recognized.
This graph shows eveidence that a 3-year term loan will have higher borrower rates when compared to a 1-year term loan in each Prosper rating. In addition, loans with a 5-year term are more concentrated than a 3-year term loan and has a smaller total range as well.
Regarding my initial observations on the Prosper loan dataset containing up to 81 variables, I was considering changing datasets because I believed the high variable count would make it hard to explore and analyze. However, after examining the dataset’s variable dictionary that explained the variables in the data set, I was able to determine which variables might be interesting to explore. Furthermore, I researched and explored each variable primarily focusing on features related to the borrower rates and statuses. Next, before diving into my exploratory data analysis, I took a subset of the loan data, creating a smaller dataset in which reduced the variable count to 15.
In my updated dataset, there were only 7 numerical variables. Initially, my primary focus was to start analyzing the borrower rate and the various borrower statuses. However, I decided to go through each variable one by one, trying to understand any relationships between the variables in the dataset since I’m currently working in the financial industry. As a result, most of the data visualizations I created included at least one categorical variable.
I did spend a lot of time researching how to create visualization with one, two, or more categorical variables. One of my main struggles, which is not uncommon, was the material that I’ve learned up to this point were outdated, so I ended up doing additional research on the latest and up-to-date libraries and functions needed to complete this project. Based on my research and what I’ve learned, my solution was to utilize boxplots, bar charts, jitter plots, heat maps, and a variety of libraries/functions to create the visualizations needed for this project.
In conclusion of my project, my analysis shows that Prosper rating and term are two major indicators of the borrower rate. In addition, borrower statuses also have impacts on borrower rate, but their effects are either small or are also factors of Prosper rating. Overall, I think this project would be a good reference for anyone who wants to borrower money from a P2P lending platform. For future work and to expand the project, I would extend testing on this program by allowing organizations to extrapolate the results to a broader population to gain better insights into their data. Another course of action is to try to relate various variables and make a predictive model using logistic regression for binary yes or no investment decision or any classification algorithm that would align with their vision and mission statement. Overall, I enjoyed this project. I will say that my exploratory data analysis did give me better insight into the lending business.